As mentioned in [1] One major concern of subjective annotation is that the annotations provided by different workers for each image may not be reliable, which calls for consistency analysis on the annotations. We use Spearman’s rank correlation ρ between pairs of workers to measure consistency and estimate p-values to evaluate statistical significance of the correlation relative to a null hypothesis of uncorrelated responses. We use the Benjamini-Hochberg
procedure to control the false discovery rate (FDR) for multiple comparisons [2]. At an FDR level of 0.05, we find 98.45% batches have significant agreement among raters. Further consistency analysis of the dataset can be found in the supplementary material of [1].
[1] Kong, Shu, et al. “Photo aesthetics ranking network with attributes and content adaptation.” European Conference on Computer Vision. Springer, Cham, 2016.
[2] Benjamini, Yoav, and Daniel Yekutieli. “The control of the false discovery rate in multiple testing under dependency.” Annals of statistics (2001): 1165-1188.